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Abstract: Let s = si..s n be a text (or sequence) on a finite alphabet E of size a. A 
fingerprint in s is the set of distinct characters appearing in one of its substrings. The 
problem considered here is to compute the set T of all fingerprints of all substrings 
of s in order to answer efficiently certain questions on this set. A substring Si..Sj is a 
maximal location for a fingerprint f £ F (denoted by if the alphabet of Si..Sj is / 

and Si-i, Sj+i, if defined, are not in /. The set of maximal locations in s is C (it is easy 
to see that \C\ < no). Two maximal locations and (k,l) such that Si..Sj = s^-Si 
are named copies, and the quotient set of C according to the copy relation is denoted 
by C C - 

We first present new exact efficient algorithms and data structures for the following 
three problems: (1) to compute T\ (2) given / as a set of distinct characters in E, to 
answer if / represents a fingerprint in T\ (3) given /, to find all maximal locations 
of / in s. As well as in papers concerning succinct data structures, in the paper all 
space complexities are counted in bits. Problem 1 is solved either in 0(n + \£c \ logcr) 
worst-case time (in this paper all logarithms are intended as base two logarithms) using 
0((n+ \Cc\ + \F\ logcr) logn) bits of space, or in 0(n+ \C\ log a) randomized expected 
time using 0((n + \F\ logcr) logn) bits of space. Problem 2 is solved either in 0(|/|) 
expected time if only 0(\f\ logn) bits of working space for queries is allowed, or in 
worst-case 0(\f\/e) time if a working space of 0(cr e logn) bits is allowed (with e a 
constant satisfying < e < 1). These algorithms use a data structure that occupies 
|.F|(21og<7 + log 2 e)(l + o(l)) bits. Problem 3 is solved with the same time complexity 
as Problem 2, but with the addition of an occ term to each of the complexities, where 
occ is the number of maximal locations corresponding to the given fingerprint. Our 
solution of this last problem requires a data structure that occupies 0((n+ \ Cc\) logn) 
bits of memory. 

In the second part of our paper we present a novel Monte Carlo approximate 
construction approach. Problem 1 is thus solved in 0(n + \C\) expected time using 
0(|.F| logn) bits of space but the algorithm is incorrect with an extremely small prob- 
ability that can be bounded in advance. 

1 Introduction 

We consider a finite ordered alphabet E with a = \E\ and s — si..s n a sequence of n 
letters, Si € E. The set of all sequences over E is denoted E* . The rank of each letter 
a in E is given by fs(a) that ranges between and a — 1. A sequence v G E* is a 
factor or substring of s if s = uvw. The fingerprint C(s) of a sequence s is the set of 
distinct letters in s. By extension, C s (i,j) is the set of distinct letters in Si..Sj. 

* This work is supported by the Russian Foundation for Fundamental Research (Grant 
05-01-00994) and the program of the President of the Russian Federation for sup- 
porting of young researchers (Grant MD-3635. 2005.1) 
** This work is also supported by the french ANR project MAPPI. 



Definition 1. Let C be a set of letters of £ . A maximal location of C in s = s\..s n is 

an interval [i,j], 1 < i < j < n, such that 

(1) C 3 (i,j)=C; (2) ifi>l,s i - 1 <?C a (i,j); (3) if j < n, s j+1 # C s (i, j) 
This maximal location is denoted 

We denote by T the set of distinct fingerprints and by C the set of maximal locations 
of all fingerprints of T . 

Definition 2. Two maximal locations (i,j) and (k,l) of s — si..s n are copies if Sj..Sj 

= S k ..Si. 

The "copy" relation is obviously an equivalence relation over C, and we denote Cc the 
set of equivalence classes. In this paper, given a sequence s, we are interested in three 
main problems: 

1. Compute the set T of all fingerprints in s; 

2. Given a fingerprint /, find whether / is a fingerprint in T\ 

3. Given a fingerprint /, find all the maximal locations of / in s. 

Efficient answers to these questions have many applications in information retrieval, 
computational biology and natural language processing [I]. The input alphabet £ is 
considered to be the alphabet of the input sequence, thus o < n. Notice that \C\ < na. 
The best current algorithms solve Problem 1 in (9(min{n + \C\ logo", n 2 }) time and 
space. The bound 0(n + |£| log o) is that of [15]. The 6>(n 2 ) bound is obtained using the 
algorithm of Didier et al. [B]. Problem 2 is solved in 0(\f\ log(cr/|/|)) time and 0(\J-\) 
space (0(|-F| logn) bits) and Problem 3 in 0(\f\ log(<r/|/|) + occ) time (where occ is 
the number of maximal locations that match the given fingerprint) and 0(|.F| + |£|) 
space (0(0^1 + |£|)logn) bits) in |5B] . 

We first present new exact efficient algorithms and data structures for the three 
problems we considered above. 

Problem 1 is solved either in 0(n + \Cc\ logo") worst-case time using 0((n + \Cc\ + 
\J-\ log a) log n) bits of space, or in 0(n + |£|log<r) randomized expected time using 
0((n + |.F| logcr) logn) bits of space. 

Problem 2 is solved either in 0(|/|) expected time and space if only 0(|/|logn) 
bits of working space for queries is allowed, or in 0(\f\/e) worst-case time if a working 
space of O(o e logn) bits is allowed. This problem uses a data structure which occupies 
-F|(21ogo + log 2 e)(l + o(l)) bits. Previous and new exact results are summarized in 
tabled 

Problem 3 is solved in the same time as Problem 2, with the addition of an occ term 
to each of the complexities, where occ is the number of maximal locations corresponding 
to the fingerprint searched. Previous and new exact results are summarized in tables 
1-3. 



Solution 


Build space (bits) 


Build time 


prev. .15 (worst-case) 


0((n+|£[)logn) 


0(n+\C\\oga) 


theorems 2|3 (worst-case) 


0{(n + \£ c \ + \F\ logo-) logn) 


0(n+ \£c\loga) 


theorem 


() 


(randomized expected) 


0((n+\T\ logo) logn) 


0(n+\C\ logcr)) 


theorem 
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(Monte-Carlo) 


0((n+|J"|)logn) 


0{n+\C\) 


Ta 
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e 1. Previous and new solutions to Problem 1 (Determination of T). 



Solution 


Data structure space (bits) 


Query time 


prev. 


0{\F\ logn) 


o(|/| iogWi/1)) 


theorem |4 


Om log a) 


0(1/1) 



Table 2. Previous and new solution for Problem 2 (existential fingerprint queries). 



Solution 


Data structure space (bits) 


Query time 


prev. 


0(|£|logn) 


O(\f\\og(a/\f\)) + occ) 


theorem 


5 


0(|£|logn) 


0(|/| + occ) 


theorem 


•5 


0((n + |£ c |)logn) 


0(\f\ + occ) 



Table 3. Previous and new solutions to Problem 3 (maximal location report queries) 



In this article we also propose a novel Monte Carlo approximate query approach. The 
result of the query may not be exact, but an error occurs at a probability that one 
can fix a priori as small as required. This approach has the advantage of speeding 
up the identification of all fingerprints by a logo - factor. Problem 1 is thus solved in 
0(n + |£|) expected time using 0(|-7-"| logn) bits of space using a Monte Carlo approach, 
but the algorithm yields incorrect results with an extremely low probability. Table [3] 
summarizes the complexities of the construction space and time including the Monte- 
Carlo method. 

Our algorithms are based on several tools of four main natures: hash functions, 
succinct data structures, trees, and naming techniques first introduced in [12] , adapted 
to the fingerprint problem in [l] and then successively improved in [6j and in [15] . These 
tools are presented in Section[2] In Section|3]we present our 0(n+|£c| logo") worst-case 
time construction algorithm. Section [4] presents a more space efficient representation of 
T in space 0(|.F| logo - ) bits instead of Od^l logn) bits. This data structure allows us 
to solve Problem 2 and 3 in the complexities bounds announced above. Then Section [5] 
contains the 0(n + \C\ log a) expected time algorithm using 0((n + l^l log a) logn)-bit 
space for solving Problem 1. Finally, in Section[6]we present the Monte Carlo algorithm 
that allows us to efficiently solve Problem 1 in time 0(|£|) and space 0(|-F| logn) thus 
saving a log a factor in both space and time complexity of the algorithm in section [5] 

We assume below without loss of generality that the input sequence does not contain 
two consecutive repeating characters. Such a sequence is named simple. The segments 
of repeating characters, say a, of any input sequence can be reduced to a unique 
occurrence of a. The two sequences have the same set T and the same sets C and Cc, 
up to small changes in the bounds (these changes can be simply retrieved in 0(1) time 
per maximal location and produced by trivial algorithm in (9(n) time). This technical 
trick greatly simplifies the algorithms we present by removing many straightforward 
technical cases. 

All the algorithms presented in this paper assume the unit-cost word RAM model 
with word length w — fi(logn) and with usual arithmetic and logic operations taking 
constant time (additions, multiplication, bitwise operations etc.). 

2 Tools 

This section is devoted to the four main tools we use in our algorithms, namely poly- 
nomial hash functions, the suffix tree, the participation tree and the naming technique. 



2.1 Hash functions 



Our constructions are based on the use of polynomial hash functions modulo P, where 
P is a suitably chosen prime. Given a collection M of m sets over a universe a, our 
goal is to find a polynomial hash function so that each set is mapped to a distinct 
value. The polynomials are evaluated modulo an arbitrary prime P chosen such that 
m 2 a < P < 2m 2 a (we will show later how to efficiently find such a prime). More 
precisely, we will use a family of hash functions Hp = {hx \X £ [1, P — 1]}, where each 
hash function hx € Hp in the family is parametrized with an integer X € [1, P — 1]. 
The functions of the family are defined in the following way : for any set Soft distinct 
integers S = {ei, e2, . . . , e t } such that SC [0, a — 1] we have: 

t 

hx(S) = ^X ei modP 
i=i 

In order to compute a fixed hash function hx on any set S in 0(\S\) time, we can use a 
precomputed table of size a, which stores all the powers of X up to X"' 1 . Alternatively, 
we could use a two-dimensional precomputed table T of size c • [a 1 ' c ] for any integer 
c ensuring a computation time of 0(c|S|). That is, we store in T[i,j] the number X 1 "* 1 
where 7 = [cr 1 ^] . Then in order to compute X Bi , we can use the property that e< can 
be decomposed into a sum of c numbers : 

c-l 

e» = ^ ' dij 7 J 
3=0 

where each dij can be computed using the formula: 

dij = [ei/-y j \ mod 7 
Thus for computing X ei , it suffices to use the formula: 

c-l c-l 
X e i= "Q X <W = J]T[dy,j] 

3=0 J=0 

To summarize, given any set S = {ei, e2, . . . , e t } where S C [0,er — 1], hx{S) can 
be computed in 0(c • t) time. First, for each a, compute X ei in O(c) time: for each 
ei compute its decomposition X)o<j<c dij~i 3 in O(c) time where each dij is computed 
by dij = L e i/7 3 J mod 7, and then compute X ei also in O(c) time using the formula 
X ei — Yio< < c T\dij,j\- Thus, the computations of all X ei take 0(c ■ t) time in total. 
The final step is to sum all of the computed X ei which takes time 0(t). 
Summarizing, for any set S of t elements the computation of hx(S) takes 0(c-t). The 
space needed by the precomputed table T is 0(c ■ o 1 ^). 
In the following we will need the technical lemma below: 

Lemma 1. Given a collection M ofm integer sets where each set is a subset of [0, a — 
1], a randomly chosen hash function hx £ Hp for P > m 2 a will injectively map the 
collection M to the interval [0, P — 1] with probability at least 1/2. 

Proof. The lemma is easy to prove. Take any pair of sets (x,y) € M 2 . The two sets 
x and y are mapped to the same hash value by a function hx € Hp if and only if 



(hx (x) — hx (y)) = 0. Now hx (x) — hx (y) is a polynomial of degree at most a — 1 over 
the field GF[P] which consequently can have at most a — 1 roots. Therefore for any 
pair (x, y) £ M 2 we have that (hx{x) — hx(y)) can possibly be zero for at most a — 1 
different values of X. As we have m(m — l)/2 such pairs, the number of values of X 
for which we have a collision for any of the pairs is at most t = (a — l)m(m — l)/2. 
We have P — am 2 and therefore t < P/2. 

We now sketch how to efficiently find one prime number in the interval \m 2 a, 2m 2 a\. 
By well known properties of the distribution of prime numbers, we know that the 
density of primes below a given number TV is roughly logarithmic in TV. This suggests 
the following simple algorithm: randomly pick a number P in the interval [m 2 a, 2m 2 a]. 
The number P will be prime with probability Oil/ log(m 2 <r)) = J?(l /(log m + log a)). 
Then test whether P is a prime using any efficient deterministic primality testing 
algorithm that takes time polylogarithmic in P. If P is not a prime, then repeat the 
same procedure (pick a random P in the interval and test its primality) until we 
get a prime P. Because the probability of P being prime is J?(l/(logm + logo - )), the 
expected number of repeated procedures will be 0(log m+log a). As a primality testing 
takes time polylogarithmic in (m 2 log a) and we are doing 0(logm + log a) expected 
primality tests, we deduce that the total time for finding P is 0((logm + log<r) c ) for 
some constant c. 

2.2 Succinct Data Structures 

Succinct Static Function Representation We will make use of the following 
recent result: 

Lemma 2. 1161/ Given a set S C U where \U\ < 2™ ,\S\ > log \U\ and a function f from 
S into [0, 2 k — 1] (with k < w), we can, in 0(151) time build a succinct representation of 
the function f that uses \S\k(l + o(l)) bits. Given any element x £ S the representation 
returns f(x) in constant time. Given an element x £ U\S, the representation returns 
an arbitrary value in [0, 2 k — I] in constant time. 

The result stated in the lemma was first described in |16j . It combines the use of a 
set of hash functions with matrix solving on GF[2 k ] (two similar methods are also 
described in [519] but have slightly worse performance). The lemma says that we can 
have a representation of a function / from S C U = [0, 2 W — 1] into [0, 2 k — 1] that can 
successfully return the correct value for f(x) when queried for an element x £ S, but 
returns an arbitrary value for any element x outside S. Therefore, the representation 
is unable to detect whether a given element x is in S or not. This is why the space 
usage in the lemma has no dependence on U, but instead only depends on k and on 
the cardinality of S (it is easy to see that in order to detect whether x £ 5* we need to 
store S in one way or another and thus need to use a space of at least ^(15*1 log \U\) 
bits). 

Succinctly Encoded Tries (Cardinal trees) A trie (or cardinal tree) is a tree 
where each edge has a label from the alphabet E. The maximal degree in a trie is thus 
a — \S\. A standard representation of a trie of JV nodes would need 0(N log N) bits 
(essentially the log TV bits are needed to encode pointers in the trie). In our case we 
need a succinct representation that uses less than 0(N log TV) bits, ideally close to the 
information theoretic lower bound which is about TV log a + O(N) bits. We will thus 
use the following result described in [17] : 



Fig. 1. Suffix tree of s = a 1 62 «3 ca £5 a-6 b-j ag eg dio#n. Square boxes contain the 
initial position of the suffix. Each edge is labeled by a pair [k, I] pointing to s^.s; that 
we explicitly write on the edge for clarity. 

Lemma 3. Given a trie (cardinal tree) having a total of N nodes over an alphabet of 
size a > 2, we can build a representation that uses iV(log a + log 2 e + o(l)) bits of space 
and supports basic navigation operations in constant time. In particular it supports the 
following operation in constant time: given a node p having identifier i p and a character 
a, tell whether p has a child d labeled with character a and return its identifier id- 

The operation cited in the lemma is the only one which will be used in this paper. 
2.3 Trees 

Suffix Tree The suffix tree ST(s) is a compact representation of all suffixes of a given 
sequence s = si . . . s„. It is basically a trie of all suffixes of s where all the nodes with 
a single child are merged with their parents. Each transition of the tree is then coded 
as an interval [i,j] corresponding to Si-.Sj. Its size is 0(n) and it can be built in O(n) 
time even on integer alphabet using the construction algorithm of [10] . An example of 
such a suffix tree is given in Figure [I] 

We assume below that in the suffix tree each transition interval [i,j] of ST(s) 
corresponds to the leftmost occurrence of the factor Si...Sj in s. For instance, in 
Figure [lj the transition from 1 to 2 is the pointer [1,1] = si = a. This property is 
ensured by Ukkonen |18| algorithm, but can also be ensured on every suffix tree by a 
simple additional 0(n) steps. 

Fingerprint Trie We now present the fingerprint trie (this is called backtracking 
tree in [314] ). The fingerprint trie is a tree representation of the fingerprints. The trie 
representation exploits the property that for every / £ T such that |/| > 2 there exists 
necessarily at least one other fingerprint g £ T and some letter a such that gU{a} — f. 
In other words, for every / £ T there exists some g £ T such that / can be written 
as a sequence f3o--Pj, ct (of distinct characters) and g £ T written as a sequence Po../3j. 
This property means that the set of fingerprints can be represented as a trie. More 



precisely, let Fi C T be the subset of the fingerprints of T where each / £ Fi is of size 
i. At the beginning, we start with a trie which contains only a root. Then we take the 
subset Fi of all fingerprints in T consisting of one character. Then for each fingerprint 
/ £ F\ consisting of a character a, we create a new node corresponding to / and attach 
it as a child of the root with a link labeled with the character a. Then the remainder 
of the trie can be built level-by-level: for building level i > 2, we consider the set Ti 
and for each / £ Fi do the following: 

1. First consider a fingerprint g £ Fi-i (represented by a node q g ) and a character a 
such that g U {a} — f (by the property above there exists at least one such pair 
(g, a)). If there exist several such pairs choose one arbitrarily. 

2. Then create a new node g/ and attach it as a child of the node q g (which corre- 
sponds to g) with a link labeled with character a. 

2.4 Naming Technique 

The naming technique is used to give a unique name to each fingerprint from T . We 
assume for simplicity, but without loss of generality, that a is a power of two. We 
consider a stack of log a + 1 arrays on top of each other. Each level is numbered from 1. 
The lowest, called the fingerprint table, contains a names that are [0] or [1]. Each other 
array contains half the number of names that the array it is placed on. The highest 
array only contains a single name that will be the name of the whole array. Such a 
name is called a fingerprint name. Figure [2] shows a simple example with a = 8. 



[7] 


[5] 


[6] 


[2] 


[2] 


[3] 


[4] 


[1] [0] 


[1] 1 10] 


[1] 1 [1] 


[0] 1 [0] 



Fig. 2. Naming example. 



The names in the fingerprint table are only [0] or [1] and are given as input. Each 
cell c of an upper array represents two cells of the array it is placed on, and thus a pair 
of two names. The naming is done in the following way: for each level going from the 
lowest to the highest, if the cell represents a new pair of names, give this pair a new 
name and assign it to the cell. If the pair has already been named, place this name 
into the cell. In the example in Figure [2] the name [2] is associated to ([1], [0]) the first 
time this pair is encountered. The second time, this name is directly retrieved. 

Naming a List of Fingerprint Changes. Assume that a specific set 5 of fingerprints 
can be represented as a list L = (a,\, Q2, ■ • ■ Q p ) of distinct characters such that S — 
{.fi, h, f P } where /» = Ui<.,<i{(X,}. 

The core idea of the algorithm of |6| is to fill a fingerprint table bottom-up by 
building for each level an ordered list of new names that corresponds to the fingerprint 
changes induced at the previous level. A pseudo-code of this naming algorithm is given 
in Figure [3] We explain it below. 

We number the levels from 1, the lowest, to log a + 1. The original list L is first 
transformed into a list L\ of changes on level 1 by replacing each character a; by the 
pair {[1], fs{cti)}. To initialize the process we add a list of a pairs {[0],i}, i = 0..ct — 1 
at the beginning of L\. 



NAME JOISTS (L = (ai, 02, a p ) initial list of changes) 


1. 


Lj < 


-({[0],0},...,{[0],a-l}) 


2. 


add ({[1],/jj(oi)}, . . . ,{[l],/j;(ap)}) to end of Li 


3. 


For 


r = 1.. log <t Do 


4. 




FT r name table of size a/2 r ~ 1 


5. 




E tp first element of L r 


6. 




For / = 0..cr/2 r_1 - 1 Do /* initialization of table FT */ 


7. 




{[a], 3} «- E tp 


8. 




FT r [j] «- [a] 


9. 




E tp next element in L r 






End of for 


11. 




Let L' r be an empty list 


12. 




E tp first element of L r 


13. 




While E tp exists Do 


14. 




{{a],j} «- E tp 


15. 




FT r [j] «- [a] 


16. 




add {(FT r [2[j/2\],FT r [2[j/2\ +1]), [j/2\} to end of L' r 


17. 




E t p next element in L r 


18. 




End of while 


19. 




sort the pair of names in L' r in lexicographical order 


20. 




give new names in each unique pair in L' r 


21. 




build L r +i by copying L' r but replacing each pair by its new name 


22. 


End of for 



Fig. 3. Naming a list L = (ai, Q2, . . . a p ) of fingerprint changes. 



This initial list is then used to compute all names of the cells in the second level. 
A table FT of a names temporary records the pair of names to be coded. A list L' x 
of pairs of names is built as follows. The first a elements of Li are read to initialize 
FT. The list L[ is initialized with a/2 pairs built by reading FT. Then, the remainder 
of the list Li is read and for each new element {[a],j} (1) the table FT is changed 
in position 3 by FT[j] <- [a] and (2) the pair {(FT[2[j /2\], FT[2[j /2\ + 1]), [j/2\} is 
added to the end of L[. This means that in cell [j/2\ of the second level a name has 
to be given to the name pair (FT[2[j /2\], FT[2[j /2\ + 1]). 

At this point L' x records the list of changes to be made in the cells at level 2 and 
the pairs of names that must receive a name. The pairs in this list are then sorted in 
lexicographical order (through a radix sort) and a new name is assigned to each distinct 
pair of names (ni,na). A new list L2 is built from L' x (keeping the initial order of L' x 
and thus of Li) by replacing each pair with its new name. For instance, if {([1], [0]), 1} 
was in the list L' x and if the pair ([1], [0]) received the new name [2], then L2 now 
contains {[2], 1}. 

The list Z/2 is the input at level 2 and the same process is repeated to obtain the 
names in the third level, and so on. The last list Li ogCT +i contains the names of all the 
fingerprints of S. 

Complexity. The sum a + a/2 + a/4 + . . . (lines 1 and 6-10 of pseudo-code in Fig. [3| 
for all cell initializations is bounded by 2a. The remaining construction of L\ (line 2) 
requires @(|L|) time. Then a linear sort of 6>(|L|) elements is performed for every level. 
As there are logo - + 1 levels, naming the list takes 0(a + \L\ logo) time. 



3 Faster Fingerprint Computation 



Let q € Cc and be a maximal location of q, then we denote st s (g) as the string 
Si..Sj. Table |4] shows an example of a copy relation. Note that the number \Cc\ can 
be significantly less than \C\. As an example, we can consider the word Wk over the 
alphabet Sk = {ai, 02, . . . , a k } which is defined in the following inductive way: Wi = ai 
and Wk = Wk-i(aio,2 ■ ■ ■ ik) k for k > 1. For this word we have |tOfc | = |fc(fc+ l)(2fc + 1), 
\C\ = ±k(3k a + 2k 2 - 9fc + 16) = <9(K| 4 / 3 ), and \C C \ = |fc(fc 2 + 5) = 9(\w k \). Thus, 
in this case \£c\ = o(\C\) as k — > 00. 

Participation Tree Let s — si..s n be a simple sequence of characters over S. In 
this first phase, for reasons that will become clear below, we add to the sequence a 
last character s n +i = # that does not appear in the sequence. Thus s = si..s n #n+i- 
Let i and j be positions ins, 1 < i < j < n + 1. We define fo s (i, j) as the string 
formed by concatenating the first occurrences of each distinct character touched when 
reading s from position i (included) to position j (included). For instance, if s — 
oi&2a3C4e5a6&7Ci8C9dio#, fo s (3,9) = aceb and fo s (5, 10) = eabcd. 

Definition 3. Let s — si..s n s n +i with s n +i = # and 1 < i < n be a position in s. Let 
j > i be the minimum position such that Sj = Si if it exists, j = n + 2 otherwise. We 
define lfo 3 (i) =fo„(i,j - 1). 

For instance, if s = ai&2C3a4d5a6&7agCg&ioen#i2, lfo s (l) = abc and lfo s (5) = 

The participation tree resembles a tree of all lfo s (i) in which we removed terminal 
characters (the need of this removal will appear clearly below). It contains the same 
path labels. The participation tree allows some redundancy in the path labels, i.e. the 
same path label might correspond to several paths from the root. Thus, our tree is 
not always "deterministic" in the sense that a node can have several transitions by the 
same character. We define it and build it from the suffix tree by cutting and shrinking 
edges. 

Let s — si..s n s n+ i where s n+1 — The participation tree PT(s) is built from 
the suffix tree ST(s) in the following way. Imagine the suffix tree in an "expanded" 
version, that is, each edge [i,j] is explicitly written by the corresponding factor Si..Sj 
(see Figure IT]). Let us consider the sequence of characters on some path from the root 



Class q 


Maximal locations 


st s (q) 




Class q 


Maximal locations 


st s (g) 


I 





e 




9 


e5a e 


ea 


1 


a\ | a-j | ae j as 


a 




10 


016203 | a^b-ras 


aba 


2 




b 




11 


ai62fl3C4 | a6&7«8C9 


abac 


3 


C4 C9 


c 




12 


agcgdio 


acd 


4 


dio 


d 




13 


a 3 C4e5a6 


acea 


5 


es 


e 




14 


e^a^bTas 


eaba 


6 


a 3 c 4 ascg 


ac 




15 


aeb7agCgdio 


abacd 


7 


Cgdio 


cd 




16 


016203046506670809 


abaceabac 


8 


C4C5 


ce 




17 


oi6 2 a3C4e 5 06&7a8C9dio 


abaceabcd 



Table 4. Copy relation example for s — a\ 62 03 C4 es a$ 67 as eg dio. 



Fig. 4. From suffix tree to the participation tree (right picture) of s = 
a\b2asC4,e^aQb-jasCgd\oH z ii- New nodes are in gray. The e transitions are removed in 
the last step. Attached suffixes are shown in square boxes. 



and let a be the first character on this path. Let o be the second occurrence of a on 
this path if it exists. We perform the following steps: 

1. We first reduce all characters on this path after o (included) to the empty string 

e; 

2. Then, on the section from the root to the character before o we only keep the first 
occurrence of each appearing character, i.e. the others are reduced to e; 

3. We then replace the terminal character of each path from the root to a leaf by e; 

4. We replace all multi-character edges by an equivalent series of a single character 
and a node. An example of such a resulting tree is shown In Figure [4] (left); 

5. As a last step, all e edges (p, s, q) are removed by merging p and q. The resulting 
tree is the participation tree. An example of this last tree is shown in Figure [4] 
(right). 

For each node q of ST(s) and PT(s) we denote by Suff(g) the set of suffixes of s 
that appear as leaves of the subtree rooted in q. We consider below that the suffixes 
associated to a node in ST(s) remain associated to the node in PT(s), even after the 
merging. This is shown in Figure [4] the suffixes in the square boxes associated to nodes 
4 and 5 in the left picture are associated to node 2 in the participation tree (right 
picture). 

Lemma 4. Let s = si..s n . For all i — 1, . . . , n, each proper prefix of lfo s (i) labels a 
path from the root in PT(s). 

Proof. When nodes are ignored, the reduction of the path of a suffix i in the suffix tree 
corresponds to lfo s (i) without its terminal character. □ 

Note that a proper prefix of lfo s (i) might label several paths from the root in PT(s). 

Let [i,j] be an interval on s — s 1 ..s n and let Support ( [i, j]) be the minimal position 
Pi * < P < j, °f the rightmost occurrences of each letter in Si . . . Sj. We define O^'^ as 
fo s (Support([i, j]), j). For instance, if s = ai&2a3C4e5a 6 fe 7 a g C9ciio#ii, Support ([1, 3]) = 
2, Support([4, 10]) = 5, Op 1 = ba and o| 4 ' 101 = eabed. 



Definition 4. Let s — si..s n and 1 < i < j ' < n. We define Extend s (i, j) as the 
maximal location reached when extending the interval [i,j] to the left and to the right 
while the closest external characters Si_i or Sj+i (if they exist) belong to C s {i,j). 

For instance, if s = ai 62 az C4 es a§ 67 as eg dio#ii, (1j4) = Extend s (2,4) and 
(1,9) = Extend s (2,7) 

Lemma 5. Let be a maximal location of s = si..s n . There exists a permutation 

of all characters of C a (i,j) that labels a path from the root in PT(s). 

Proof. O3 is obviously a permutation of C s (i,j) andaproper prefix of lfo s (Support((z, j})), 
which, by lemma [4] labels a path from the root in PT(s). □ 

Corollary 1. Let s = si..s n . For all i, j, 1 < i < j < n, there exists a permutation of 
all characters of C„(i,j) that labels a path from the root in PT(s). 

Proof. It suffices to extend the segment Si..Sj to (k,l) = Extend 3 (i,j) in which it is 
contained. Then C s (i,j) = C s {k,l) and lemma [5] applies. □ 

Let z = ((r, ai,pi), . . . , (pi_i, cti,Pi)) be a path in PT(s = Si..s n ) from its root r. 
By notation extension, we denote Suff(«) = Suff(pi). Let SPref(s) be the set of all such 
paths and w(z) = a.\Ui..OLi. Let V(C) be the set of all sets of maximal locations. We 
consider the function <P formally defined as: 



Lemma 6. Let z — ((r, ai,pi), . . . , be a non-empty path in SPref(s). 

Then $(z) 0. 

Proof. By construction of the participation tree, there exits m € Suff(z) such that 
qi . . . cti is a proper prefix of lfo(m). Let p be the first position of ai in s following m. 
Then Ui</<i{a/} = C s (m,p). Let (k,l) = Extends (rre, p) . 

We prove now that Support((fc, I)) — m. As cti . . . ai is a proper prefix of lfo(m), 
there exists an a — lfo(m)i+i such that there is no occurrence of a in the interval 
[m,p], and thus after the extension of [m.p] to a maximal location (k, I), the indice I 
is strictly less than the indice of the first occurrence of a after m. As, by definition 
of lfo(m), there is no occurrence of s m before the indice of a after m in s, there is 
no other occurrence of s m at the right of s m in the interval [m, I] . Moreover, since all 
characters in cti . . . 014 and only them appear after m in [m, I] in the order of a\ . . . ai 
and the extension procedure ensures that all characters in [k, m] are characters from 
Qi . . . ai, we have Support((fc, I)) — m. 

Finally, it is obvious that oi k ' 1 ' 1 = O s m,p ' = a\..ai = w(z), and thus (k,l) G ^(-z). 

□ 

Lemma 7. Let zi, Zi £ SPrefis) be two distinct non-empty paths. Then <l>(zi) PI 
#(22) = 9. 

Proof. Assume a contrario that there exists {k, 1} £ &(zi)n&(z2). Let m — Support((fc, I)), 
m £ Suff(2:i) and m £ Suff(^2). Thus one of the paths is a prefix of the other. As 
0[ k ' 1 ^ = w(zi) = w(z2), the two paths must be equal, which contradicts the hypothe- 
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Lemma 8. Let and (k,l) be two distinct maximal locations of s = si..s n in the 

same equivalence class of Cc- Then there exits z 6 SPref(s) such that both and 
(k,l) are contained in &(z). 

Proof. Let mi = Support((i, j)) and m.2 = Support((fc, I)). As Si..Sj = Sk-.si, u — 
si and mi and mi are thus in the subtree of the path h labeled by 
u in ST(s). After reduction of this path in PT(s), the resulting path z is such that 
w{z) = O l i' 3) =oi M) , so mi, m 2 € Suff(z). Thus (i,j),(k,l) €${z). □ 

Theorem 1. Any maximal location is contained in the image $(z) of some path z 
in PT(s = si..s n ), and the size of PT(s) (without the initial positions of suffixes) is 
0(\C C \). 

Proof. Lemma [5] directly implies that all maximal locations are in the image &(z) of 
a path z in PT(s). As by lemma [7] the images ${z) are non-overlapping, they form a 
partition of C. Lemma [8] ensures that Cc partition is a subpartition of the partition 
formed by the images of As by lemma [6] there is no empty image, the number of 
such images is smaller than or equal to \Cc\- d 

Note that we considered the size of PT(s — si_..s n ) without the initial positions of 
suffixes (square boxes in Figure Q. With these positions, the size of PT(s) is 0(n + 

\£c\). 

We explain below how to compute the participation tree from the suffix tree in 
linear time. 

From Suffix Tree to Participation Tree We extend the notion of io 3 (i,j) 
keeping the positions of the characters in s = si..s n . We define efo s (i) as the string 
formed by concatenating the first occurrences of each distinct character touched when 
reading s from position i (included) to position n (included) but indexed by the posi- 
tion of this character in the sequence. For instance, if s — ai&2«3C4e5a667agCgciio#ii, 
efo„(3) = a3C4e 5 Mio#n and efo s (5) = e s a 6 6 7 codio#ii. 

The idea of the algorithm is the following. For each transition (i, j) on the path of 
a longest suffix v = Sk ■ ■ .s n , we compute the "participation" of the edge to lfo 3 (fe), 
that is, the new characters the edge brings in lfo s (fc). For instance, in Figure [l]the 
participation of edge (6, 8) = [5, 11] is e, since it is on the path of the suffix S3 . . . s n 
and lfo s (3) = ace. The participation of edge (12, 14) = [5, 11] is eab since lfo s (4) = ceab. 

To compute the participation of interval [i,j] on the path of a suffix v = Sh . . .s n , 
we use efo s (fc) and also the next position of Sk after k in s, if it exists. Assume it is 
the case and let p be this position. Thus s p = Sk- Let efo s (fc) = SkS^s^ ...s; z and 
lh ^ P 5= lh+i- If i > p, the participation of is the empty word e. Otherwise, if 
i < p then the participation of [i,j] is the string (potentially empty) si a . . . si b with 

— i < l a and l a is the smallest such indice; 

— lb < min(j,p — 1) and h is the greatest such indice. 

[Note that this computation requires that the interval [i,j] which annotates a tran- 
sition in the suffix tree corresponds to the suffix v used as reference. In order to ensure 
this, below we "shift" each interval [i,j] according to the suffix we are currently reading 
before computing its participation.] 

For instance, in Figure [lj efo s (2) = &2a3C4e5dio#n and p = 7 since 7 is the next 
position of b after position 2. Thus, participation of edge (1,9) = [2, 4] = bia^CA = bac, 



Build_part_tree(5T(s = si..s n s„+i with s n +i = #)) 

1. efo s (n) = s n and p n — n + 1 

2. For z = n..l Do 



3. length «— n 

4. Current Leaf(i) in ST(s). 

5. While Current not marked AND Current 7^ Root Do 

6. Prec <- Parent(Current) in 5T(s). 

7. [fe, /] <— edge (Prec, Current) 

8. [pos_deb, pos_end] «— [length — (I — k), length] 

9. Compute the participation of [pos_deb, pos_end] in efo s (z) 

10. Mark Current 

11. length 4- length -(I- k) - 1 

12. End of while 

13. efo s (i — 1) 4— Update efo s (z) 



14. End of for 

15. Replace each terminal character of all paths from the root by e. 

16. Remove e edges by node merging. 

Fig. 5. Building the participation tree from the suffix tree. 

participation of (9,11) = [5,11] = es = e (since p — 7). For each suffix [fc,n], given 
efo s (k) and p, a bottom-up process from leaf k to the root of the suffix tree allows us 
to: 

(a) shift the pointed positions to positions corresponding to the suffix considered. The 
bottom-up approach allows to read the suffix from its end, and thus the sizes of 
the encountered transitions are enough to know which segment of the suffix the 
edge represents; 

(b) compute the participation of each (not previously touched) edge on this path. 
Also, the bottom-up approach allows us to avoid unnecessary computation, since 
the participation of an upper edge ends in efo s (fc) where the participation of the 
lower begins. 

We modify the suffix tree using successive efo s (fc), for k = n..l. A sketch of this 
algorithm is given in Figure [5] At the end of this process, we first replace the terminal 
character of all paths from the root by e. We finally remove all (u, e, v) edges by merging 
u and v. 

Theorem 2. The participation tree of s — si..s n can be built in 0(n + \Cc\) time and 
0((n + \ Cc\) logn) bits of space. 

Proof. The algorithm is correct since it consists of the direct computation of the par- 
ticipation of each edge one after the other. We now study its complexity. 

For each suffix [k, n], given efo 3 (fc) and p, the bottom-up process from leaf k to the 
root of the suffix tree can be done in 0(1) time for each unmarked node. 

We maintain each efo s (i) as a combination of a doubly linked list and an array of 
size £ in which each cell j points to the position of character f£ (j) in the doubly linked 
list. Thus, adding a character c to the head of the doubly linked list while recording 
its position in the corresponding cell of the array is 0(1). Removing a character out of 
the list is also O(l) since it suffices to find its position in the list using the array and 
remove the character using the pointer to the previous and next character in the list. 



Initializing the structure is 0(a) but it has only to be done once. In addition to the 
array and the doubly linked list, a pointer tp points to the character in the list whose 
position is just before p (the next position of s; in s) if such character exists or to the 
end of the list otherwise. An instance of this structure is given in Figure [6] 




Fig. 6. Data structure for maintaining efo(i) shown on efo s (2) = 62<J3C4esdio#ii. The 
pointer tp points to the character in the list whose position is the largest smaller 
position in the list compared to the next position p of b in s, which is 7. 



Assume that efo s (i) is represented in this way, with knowing tpi, the next position 
in the doubly linked list of the first character Si_i in efo s (i). To compute efo s (i — 1) 
and tpi-i, it suffices to test in the array if a = s^-i already appears in the list. If yes, 
tpi-i points to the character just before a in the list, if not tpi-i is set to the end of 
the list. Then a is removed out of the list and inserted at its head. The first efo s (n) is 
simply s n , and tp points to the end of the list. 

Computing the participation of each non-touched edge on a path from the root to a 
leaf corresponding to suffix i in a bottom-up manner is not expensive since it suffices to 
"consume" efo s (i) backward from tpi edge after edge as soon as an edge [k,l] (shifted 
to correspond to suffix i) is such that k is less than the position of the element pointed 
by tpi. Thus, calculating the participation of each edge in the suffix tree can be done 
in a time proportional to the participation of the edge in PT(s) tree plus the total 
number of edges in the tree. 

Replacing the terminal character of each path from the root by e is 0(n). Merging 
each of the e edges can also be performed in 0(n) since each such e edge is either a 
previous edge of the suffix tree or was labeled by a single terminal character of a path 
from the root. The whole construction of PT(s) is thus 0(n + \jCc\) time. 

The space required is the size of the suffix tree plus the size of the participation 
tree plus the size of the data structure representing efo s (i), thus 0(n + \£c\) space. □ 

We now explain how to name all fingerprints from the participation tree. 



Naming a Participation Tree The naming approach of the previous section has 
been modified in [14] to name on the same set of names a table of lists of fingerprint 
changes. The main modification is that the linear sorting is done for each level on 
all the pairs of all the lists of the table. We use a similar approach, but instead of a 
table of lists we consider the set of all paths from the root in the participation tree 
PT(s). Each such path is considered as a list of fingerprint changes, except that the 
initialization of the naming list is done once for all paths. Corollary [l] guarantees our 
approach. The Name_fingerprint algorithm names all fingerprints. Its pseudo-code 
is given in Figure [7] 



Depth_first_search(_FTj : , Current) 

1. For all a such that 5(C'urrent,a) 7^ Do 

2. g 8(Current,a) 

3. {[a], j} <— A(Current, a, q) 

4. prec <— FT k [j] 

5. FT fc [j] <- [a] 

6. ZV(CWent ; a,g) «- {(FT fc [2[j/2j], FT fc [2[j/2j + 1]), L?/2J} 

7. Depth_first_search(_FT; c ,g) 

8. Fr fe [j]^prec 

9. End of for 



Name_fingerprint(PT(s)) 

10. niniti «— [0] 

11. For k = L.loger Do 

12. FTk name table of size a/2 k ~ 1 all initialized to niniti 

13. DEPTH_FIRST_SEARCH(fT fc ,Root (PT(s) ) ) 

14. SI <- O /* empty stack */ 

15. For all edges e = (p, a, q) in PT(s) Do 

16. {(n 1 ,n 2 ),j}^A(p,a,q) 

17. Add (m,n 2 ) to SJ. 

18. End of for 

19. add the couple (ninitk,ninitk) to S7 

20. sort SI in lexicographical order 

21. give new names for each dilferent couple in SI 

22. replacing each pair in A(p, a, q) by its new name 

23. ninitk+i name of the pair (ninitk,ninitk) 

24. End of for 

Fig. 7. Naming all fingerprints in a participation tree PT(s). 



As in the list naming of section [2~4| log a iterations are performed, one by fingerprint 
array level (loop 11-24), the lowest one excepted. With each edge (p,a,q) of PT(s) a 
value A(p,a,q) is associated. At the end of iteration k, this value records the change 
corresponding to the edge in the fingerprint array of level k + 1. The value A(p, a,q) 
is assumed to be initialized with {[1], /z;(<*)} corresponding to the change induced by 
the edge at the lowest level 1. 

In each iteration k, the recursive algorithm Depth_first_SEARCH is called (line 
13) on the participation tree to update all values A(p, a, q) during a depth first search. 
The update operation on each such value is similar to the pair update in the naming of 
a simple list of fingerprint changes in section |2.4| Note that in Depth_first_SEARCH 
a special FT table is modified (line 5) before the recursive call but reinitialized to the 
previous value after the call (line 8). This permits to initialize the table FT only once 
before the first call to Depth_first_SEARCH (line 12) and thus the initialization costs 
are the same for all paths as for a single list, and thus are bounded by 2a. 

After the depth first search the values A(p, a, q) are collected on all the edges 
(p, a, q) of the participation tree (lines 14-18) in a list SI. This list is lexicographically 
sorted and a new name is given to each unique pair (line 20) , similarly to the naming of 



a single list in section 2.4 The initial pair of names of each A(p, a, q) is then replaced 
by its new name. 



To initialize the fingerprint array at the next level, the couple (ninitk,ninitk) is 
added to the list of names (line 19) and its new name is retrieved after the sorting and 
the renaming (line 22). 

Theorem 3. The Name_fingerprint algorithm applied on PT(s) names all finger- 
prints of s in 0(a + \Cc\ logo") time using 0((\Cc\ + \F\ logo") logn) bits of working 
space. 

4 A Space Efficient Fingerprint Representation 
4.1 Overview 

In this section we show how the fingerprint set can be represented in just |.F| (2 log a + 
log 2 e)(l + o(l)) bits of space instead of 0(\J- \ logn) bits. Our solution is particularly 
attractive whenever a is sufficiently small (e.g. loger = o(logn)) as it saves a fac- 
tor compared with a standard non-succinct representation that uses at least 
0(\J-\) words of space, which translates into @(|.F| logn) bits. 

Our representation relies on the fingerprint trie as described in section [23] 
Before describing our solution, we first recall some basic facts on the fingerprint 
trie that will be needed to understand our solution. First, recall the following two facts: 

1. Each node in the trie corresponds to a unique set and each set corresponds to a 
unique node. 

2. Each prefix of a fingerprint is also a fingerprint. 

Note also that the fingerprint trie implies an ordering on the characters of any given 
fingerprint represented in the trie. More precisely for a given node q, the characters 
of the corresponding fingerprint f q are ordered according to the order in which they 
appear as labels of the nodes in the path from the root to the node q. 

In our representation, the fingerprint trie will be represented in two different ways. 
This is why the space usage will be at least 2\!F\ logo bits. The first representation will 
permit a traversal of the fingerprint trie bottom-up (climb the trie) and the second 
one will permit a traversal of the fingerprint trie top-down (descend the trie). If the 
fingerprint is represented in the trie, then a bottom-up traversal will permit one to get 
the proper ordering on the fingerprint characters. Then, the presence of the fingerprint 
can be confirmed by a top-down traversal. Note that this second traversal can only 
return true if the fingerprint exists and is in the correct order represented in the trie. 
Therefore a top-down traversal will never return a false positive answer (it will never 
return true for a fingerprint not represented in the trie or for fingerprint represented 
in the trie but with a different ordering). Likewise, this top-down traversal will never 
return a false negative (it will always give a positive answer for an existing fingerprint) 
as it will be proven later that a bottom-up traversal will always return the correct 
ordering of the characters of an existing fingerprint and this correct ordering will thus 
be used to do a successful top-down traversal of the trie. 

We now give more details on our representation. First, notice that each set (fin- 
gerprint) uniquely corresponds to a distinct node of the fingerprint trie. Let f q denote 
the fingerprint associated with the node q. Let a(qi,q2) denote the characters that 
label the edge which connects a node qi to its child qi- Notice that by definition of the 
fingerprint trie for any node qi having a parent qi, we have f q2 = f qi U {0(51,52)}- 
That is, the fingerprint of the node qi is obtained by adding one character 0(51,52) to 



the fingerprint of its parent node q\. 

The solutions we propose are able to find whether a given query fingerprint / is in the 
set T in 0(|/|) time. A query for a fingerprint / represented by a string which contains 
all the characters of / in an arbitrary order will work in three steps: 

1. We query the bottom-up representation of the trie, which, when given the finger- 
print /, returns a string s of length |/|. This bottom-up representation relies on 
the use of succinct function representation of lemma [2] A detailed description of 
the step is in section |4~2| 

2. We check whether the string s is a permutation of the set /. That is, we check 
whether s[i] € / for each i £ [0, |/| — 1] and check also that all characters of s are dis- 
tinct. This step is done in time 0(|/|) with high probability using 0(|/| logcr) bits 
working space or in deterministic time 0(e|/|) using working space 

bits for any positive integer e. A detailed description of the step is in section [4~4| 

3. The final step is using the succinct top-down representation of the trie to do a 
top-down traversal for the string s. This step permits checking whether the string 
s exists in the trie representation in O(jsj) = C(|/|) time. Notice that this is 
equivalent to checking that / £ T . This is the case as by previous step we have 
checked that s is a permutation of / and we know that the trie stores a unique 
string corresponding to each fingerprint in T ' . A detailed description of the step is 
in section [473] 

In the following three subsections we describe in more detail the data structures 
used for each of the three steps. In subsection |4 . 5| we give the full picture of the query 
and prove its correctness. 



4.2 Backtracking Function (bottom-up trie representation) 

The first step is achieved through a data structure we call the backtracking function, 
which is in fact a bottom-up representation of the trie. This function associates to each 
fingerprint ft the last character in its string representation Si. We will simply use a 
static function that maps each set (fingerprint) to the last character in the character 
ordering. In other words whenever we have a fingerprint / corresponding to a node q 
in the fingerprint trie, we associate with / the character which labels the edge which 
connects q to its parent in the trie. That is, for each set we have a string representation 
that contains exactly the same characters as the set in a certain order. With each set 
we associate the last character in its string representation. 

It turns out that representing this backtracking function can be done using just 
( | J-\ log cr) ( 1 + o( 1 ) ) ) bits of space which is optimal. The generation of the backtracking 
function from the set T can be done in optimal 0(1^1) time. The generation is based 
on the use of a polynomial hash function (the same used in the so-called Rabin-Karp 
fingerprints [13]). The first step consists in a top-down traversal of the fingerprint trie. 
Recall that each node represents a distinct fingerprint. Given a node q with a parent 
p, we note the fingerprint associated with p by f p and the fingerprint associated with 
q by f q . Then, if the edge which connects p to q is labeled by character a, we will have 
fq - /pU { a }- So, during the top-down traversal of the trie we will compute a hash 
value associated with each fingerprint. For that we will make use of the polynomial 
hash functions family as described in section |2.1| More precisely, the hash functions 
we will use are polynomials modulo a prime P chosen such that P £ [|J-"| 2 cr, 2|J-"| 2 cr]. 
Finding P takes time 0((log(| F\ 2 a)) c ) = 0((log(|J r | + loga)) c ) for some constant c. 



(see 2.1 for details on the algorithm used to find P). 



Before beginning the top-down traversal of the trie, we will randomly choose a num- 
ber r from the interval [0 , P — 1] . For any fingerprint fi having elements ai,ai,...,a:|f|, 
we will associate the hash value computed using the formula H(fi) = r^^^+r^' 02 ^ 
. . . + j-fsd/l) where multiplications and additions are all done modulo P. 
Now the generation of the hash values for all fingerprints is done in the following way: 
we first associate the hash value with the root node which does not represent any fin- 
gerprint. We note by H q the hash value associated with the node q and by H p the hash 
value associated with node p. From the definition it is evident that H q = H p + r^ s ^ 
where a is the character which labels the edge connecting node p to node q. Therefore, 
during a top-down traversal of the trie, we can compute the hash value for each finger- 
print in constant time given the fingerprint of its parent node. Once we have generated 
the \F\ hash values corresponding to the \J-\ fingerprints, we will check whether all 
fingerprints are distinct. According to lemma [I] we deduce that this is the case with 
probability of at least 1/2. If this is not the case, we will choose a new value r and 
recompute the hash values in the same way during a top-down traversal of the trie. 
As on expectation we will do O(l) trials and each trial taking time 0(|J-"|), we deduce 
that the total expected time is 0(1^1). 

Once we have successfully mapped all the keys to distinct hash values in range [0, P— 1], 
we will store a static function using lemma[2]which for each fingerprint fi will associate 
the character fs(ai) (where on is the last character in fi) to the hash value H(fi). The 
space used by the static function will clearly be |.Fj (log <r)(l + o(l)) bits. 



4.3 Deterministic and Probabilistic Set Equality Testing 

We now describe a method to test for set equality. This is step 2 in our query algo- 
rithm. Given two strings si and S2 where |si| — Js2 1 , we would wish to test whether 
the two strings are permutations of the same selrl That is, we are asking if we can 
obtain the string si by doing a permutation on the characters of the string S2- We 
propose two solutions for this problem. The first one is randomized while the second 
one is deterministic. The two solutions are folklore, but we describe them here for 
completeness. 

Randomized Method The randomized method works in the following way : we use 
a dynamic perfect hash table [8] (or any other efficient hash table implementation) in 
which we insert all the characters of the string si. This takes time 0(|si|) with high 
probability and uses space 0(|si| logo") bitsQ 

During the insertion, we can easily check that the characters of sij are all distinct 
by checking that every character of si is not present in the table at the time of its 
insertion. In the hash table, we associate a bit with each key and we initialize the bit 
to zero. Now, we process the string S2- For each character a of S2 we query the perfect 
hash table for the character a. In case we do find it, we mark the bit associated with 
it. After we have processed all characters of S2, we check if all the bits associated with 
characters of si are now set to one. If this is the case, we conclude that S2 and si are 
permutations of the same set. 

3 To declare that two strings are equal we require that the two strings are permuta- 
tions. That is, the characters of each string are all distinct. 

4 A linear-space hash table needs 0(log|[/|) per element where U is the universe. In 
our case U = £ and thus \U\ = a. 



Clearly this randomized method uses 0(|si|) words of space that is 0(|si| log a) 
bits of space, which is optimal up to a constant-factor, as we also need logo" bits 
to represent 

Deterministic Method We now describe a deterministic method which can be used 
to do equality testing. The basic method needs a bits of working space for queries and 
checks set equality in optimal time 0(|«i|). A more sophisticated method could use 
space 0(a 1 ^ k logu + jsij logu) bits and answers set equality in time 0(fc|si|) for any 
integer k such that k > 1. In the basic method, we will simply use a bitvector B of a 
bits. At the beginning all the bits in B are set to zero, and we require that they are 
reset to zero after each equality test. 

The equality test works in the following way: we first process the string si. For 
each i in [0, |s| — 1], we set c = fs(si[i]) and then set B[c] — 1. Before setting B[c] = 1, 
we check that B[c] 7^ and thus that the character si[i] does not occur twice in si. 

We now traverse the string S2- For each i in [0, \s\ — 1], we set c = fs{si[i]) and 
check that B[c] = 1. If this was the case, then we set B[c] = 0, otherwise, we declare 
that si and S2 are two distinct strings. Setting B[c] to zero is necessary to ensure that 
all the characters of S2 are all distinct. 

It is easy to see that the above procedure correctly computes the equality of si and 
S2. In the first phase we have set all the |si| distinct bits corresponding to characters 
of si. In the second phase, we check that the bits corresponding to characters of .32 are 
all distinct and all set which can only be the case if those bits are precisely the si 
bits corresponding to character of si|. 

At the end of checking, if the two strings are equal, then all the bits of B are set 
to zero, so that B is ready for the next query. If the two strings are not equal, then we 
need to traverse the string si and clear the bits of B which were set to one when si 
was first traversed (we set B[c] = for every c = fs(si[i])) 

Lemma 9. We can do equality testing between two strings si and S2 over an alphabet 
of size a in time 0(\si\) using a bits of working space. 

We now describe the more sophisticated method. We only describe how to achieve 
0(^fa) space. The generalization to 0{(i 1 ^ k ) space for k > 2 can easily be deduced 
from the case k = 2. 

The method works in the following way: we first partition the characters of si 
according to their [logcr/2] most significant bits. We also do the same partitioning for 
the characters of S2- Finally, we compare all the pairs of partitions (one from si and 
one from S2) in which the characters share the same [log a/2] most significant bits. 

We now give the details of the implementation. We use a table Ti with 2 riogCT/21 < 
2y / cr cells where each cell T\[i] contains a pointer (denoted by Ti[i].P) to a list of 
characters. At the beginning we suppose that every Ti [i].P is initialized to null meaning 
that all the lists are empty. We also use a list L\ which stores a list of non-empty cells 
(cells with non null pointers) of Ti. At the beginning we process the characters of s\ 
one by one and for each character cti do the following steps: 

1. Compute j — MSB{fs{cti)), the [loga/2] most significant bits of fs(ai). 

2. Save in variable oldP the old value of Ti[j].P. 

3. Add cti to the list T x [j\.P. 

4. If oldP equals null, add j to the list L\. That is, the list Ti[j].P which was previ- 
ously empty is added to Li as now it is non-empty. 



At the end of the processing, we do a second step in which we use a second table 
T 2 similar to Ti, where each cell T^i] has a field Zt 2 u,\.p- In this step we process the 
characters of s 2 one by one and for each character cti, we add Qi to the list T 2 .P\j]. In 
the third step, we use two lists L[ and L' 2 initially empty. We take the list L\ and for 
each element j in the list do the following: 

1. Add all elements of the list Ti[j].P at the end of the list L' x . 

2. Add all elements of the list T 2 [j].P at the end of the list L' 2 . 

At the end of the third step we are left with two lists L[ and L' 2 which are sorted 
according to the list L\. That is, in each of the two lists we have first all characters 
whose [logcr/2] most significant bits are equal to Li[0] followed by all characters whose 
most significant are equal to etc. Thus, to finish the equality testing it suffices 

for every j in the list L\ to do the following: 

1. First advance in L[ in order to find Rji the longest run of t\ characters in L[ 
whose [logcr/2] most significant bits are equal to j. 

2. Similarly, advance in L' 2 to identify Rj 2 the longest run of t 2 characters in L' 2 whose 
[logcr/2] most significant bits are equal to j. 

3. Check that t\ — t 2 . If this is not the case, immediately declare that si is distinct 
from s 2 . 

4. Otherwise we check for the equality of the characters in Rji and Rji- To this end 
we already know that they have the same [logcr/2] most significant bits, so that 
we only need to do equality testing for the [logcr/2j least significant bits between 
characters of Rji and Rj2, which can be done using the procedure of lemma [9] 
This will take time O(ii) and needs to use just a bitvector of size 2^ og<T ' /2 J < ^fa 
bits. 

If all the iterations are completed, we immediately deduce that the two sets s\ and 
s 2 are equal. Concerning the running time, it is clear that the above procedure runs 
in time 0(|si|). Every element of L' ± and L' 2 is only traversed twice, the first time for 
determining the length of the runs and the second time for determining the equality 
between elements of two runs. Each time an element is traversed, only a constant 
number of operations are carried on. 

We now analyze the space usage. The total space needed to store the different lists 
will be upper bounded by 0(|si| log cr). The table Ti will use space 0(^/aloga) bits, 
while the bitvector B will use space 0(^/a) bits. 

The above algorithm can be easily generalized to use space (cr 1 /* log cr). For that 
it suffices to do the partitioning of the characters of si and s 2 in k — 1 phases. The 
log a bits of the characters are divided in slices of size about log cr/fc bits each. Then 
in each phase we partition the keys according to a one of the slices starting from the 
most significant slice to the least significant. After k — 1 partitioning we will be left 
with partitions which only differ in their (at most) log a/k least significant bits. In the 
final phase, pairs of partitions (one from si and one from s 2 ) can easily be matched as 
was done above using lemma [9] 

Lemma 10. Given any two strings S\ and s 2 of equal length, testing for the equality 
of the multisets induced by si and s 2 can be done: 

1. In expected 0(\s±\) time with high probability using only 0(\s±\ log cr) bits of space. 

2. In worst case 0(k\s\\) time using (cr 1//fc log cr) bits of space. 



4.4 Succinct Trie Representation (top-down trie representation) 



The third step of a query uses a top-down trie representation which we describe in this 
section. First of all, a trie Tr of size A^ over an alphabet a can be represented compactly 
to use optimal space AT (log a + log 2 e + o(l)) using the representation described in [17] 
permitting many navigation operations on the trie in constant time. In particular, a 
top-down traversal of the trie for a string s can be done in time 0(|s|) by using O(l) 
time at each step i of the traversal which consists in finding the child labeled with 
character s[i]. Given a string s, we can determine whether s G S in time 0([s[), by 
doing a top-down traversal of the trie. Thus, given the set T of fingerprints in a trie 
of size \T\, we can succinctly encode the trie representing the set T in time 0(|J-"|) so 
that the trie uses space of |J r |(logCT)(l + log 2 e + o(l)) bits. A top-down traversal of 
the trie will take time O(l) time per traversed node. Thus given a fingerprint / in the 
correct order, we can check whether it is presented in the set T by doing a top-down 
traversal of the succinctly encoded trie representing the set T . 



4.5 Putting Things Together 

We are now ready to describe the full details of the queries on our data structures 
described in the previous subsections. A query for a fingerprint / = {01,02, . . . ,«|/|} 
is given as a string s/ of characters consisting in the concatenation of the characters 
Qi, ct2, . ■ . , am. The characters are not necessarily lexicographically sorted. The query 
involves the following steps : 

1. Compute the hash value: 

m)= E r/ " (Ql) 

l<i<|/| 

This operation takes time 0(|/|), as it involves only 0(|/|) arithmetic operations. 
In the following we note / by f\f\ and note H(fj) by Hj. 

2. Probe the backtracking function using the hash value H\f\ — H(f), retrieving 
a character /3j (actually retrieving fs(Pj) then use the reverse mapping f^. 1 to 
get /3j). Then we do |/| — 1 steps, computing for each j G [1, |/| — 1] the hash 
value Hj-i = Hj — r^ s ^^ and probe the backtracking function using the hash 
value Hj-i retrieving the character Pj-i. At the end of the |/| — 1 steps we will 
have obtained a sequence s't = /3|/|,,S|/|_i, . . . of characters. Suppose that 
/ G T. When queried with the hash value Hj, the backtracking function would 
return in this case the last character of the fingerprint representation of /. Then 
fj-i — fj/{Pj} would also represent another fingerprint from T. More generally 
we will have fj £ T for every j £ [1, |/|] with fj — {/3i, /?2, • ■ • , Pj} 

3. The third step is to apply the method described in sect ion |4~3] in order to determine 
whether the set of characters in s'f equals the set of characters in /. If the two sets 
differ, we immediately conclude that / ^ T . 

4. Finally we do a top-down traversal of the succinctly encoded trie described in 
for the string s'f. Here if the traversal fails before attaining a leaf, we 
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4.4 



immediately conclude that / ^ T , otherwise conclude that / G T ' . 

Now we can more precisely describe what is happening inside the data structure. We 
have to analyze two cases, the case / G T and the case / ^ T . For that we first prove 
the following lemmata: 



Lemma 11. Let f £ J-. Then 



1. for each j G [1, |/|], f 3 G T; 

2. the string s'f is stored in the fingerprint trie. 

Proof. The proof of fact 1 is by induction: / is a valid fingerprint (by assumption) 
which means that the backtracking function returns the last character j3j in the trie 
representation of /. Then we know that there exists some fj-i G T such that fj-i U 
{/3j} G J- . The base case of the induction is for j = 1 (fingerprint consists of a single 
character /3i) in which case we clearly have a child of the fingerprint root labeled with 
character /3i. 

The proof of fact 2 can also be obtained by induction. Assume that the assertion 
is true for a fingerprint fj-i of length j — 1. Then it can be proved for a fingerprint fj 
of length j, i.e. the assumption says that the sequence s'f._ 1 = . . . , fij-\ forms 

a permutation of fj-i. We know that the backtracking function returns a character /3j 
which is the last character of the representation of fj in the fingerprint trie and that 
there exists a fingerprint fj—i of size j — 1 such that fj-\ U {/3j} G T . As we know that 
fact 2 is true for fj-i, it means that the sequence s'f._ 1 = /3i,/32, ■ • • ,/3j-i of distinct 
symbols is a permutation of fj-i. Hence, by adding the character /3j ^ to the 

sequence we obtain a permutation of fj . 

From there we can get the following lemma: 

Lemma 12. If f £ T then the query successfully detects that f G T and returns a 
positive answer. 

Proof. By assumption / G T , which means by fact 2 of lemma [IT] that step 2 returns 
a sequence s'f which is a permutation of the set /. That means that step 3 will return 
a positive answer. It remains to be proven that step 4 is also successful. Moreover by 
fact 1 of lemma [TT| step 4 will also be successful as step 4 traverses the fingerprint trie 
top-down where at each step it reaches a valid fingerprint fj. 

Lemma 13. Assuming that f ^ T ' , either step 3 or step 4 will successfully detect that 
f (fi T and the query returns a negative answer. 

Proof. The proof is by contradiction. Suppose that step 4 has concluded that / G T '. 
Then steps 3 tells us that we have a sequence of j characters s'f = /3o, Pi, ... , /3|/|-i 
which is a permutation of / and that moreover by successfully traversing the trie in 
step 4 we deduce that / G J- which contradicts the premise that / ^ T . 

Thus, we get the following theorem: 

Theorem 4. The set of J- of fingerprints of a sequence s = si..s n can be represented 
using a data structure that occupies |.7-"|(21og<T + log 2 e)(l + o(l)) bits. Given a set of 
characters f the data structure is able to determine whether f G J- (existential queries) 
in time 0(\f\). 

We can also use the data structure to answer to report queries. However, in this case, 
because of the need to store pointers to occurrences, the representation will no longer 
be succinct (a pointer needs J?(logn) bits to be represented). We note that for each 
fingerprint, we can just store the list of maximal locations in the sequence using 2 log n 
bits for each element giving a total of 0(|£| logn) bits. However, a more space efficient 
approach is to use the suffix tree and for each fingerprint store a list of pointers to 



named copies in the suffix tree. This reduces the space to 0((n + |£c|)logn) bits. 
Moreover, reporting the locations of the occ named copies from the suffix tree takes 
optimal O(occ) time as it consists in traversing a subtree with at most occ leaves and 
occ — 1 internal nodes. 

Theorem 5. Given a sequence s = si..s n of characters we can in time 0(n+\C c \ logo - ) 
build a data structure that occupies 0((n + \ Cc\) l°g n ) bits of space such that given a 
fingerprint f 6 T the data structure is able to report all the occ maximal locations in 
s corresponding to f in time 0(\f \ + occ). 

5 Identifying Fingerprints in Less Space 

The result of theorem[3]names all fingerprints of s in time <9(2o + \Cc j log o) while using 
0((\Cc\ + \J~\ logo - ) logn) bits of working space during the building. The value \Cc\ in 
the working space can dominate the value \F\ log o when |.F| <C \Cc\- When we need to 
build a data structure for report queries, then the value \Cc\ is also presented in the final 
size of required space and hence this presence in building space is unavoidable. However, 
when we only need to answer to existential queries, then the final data structure will 
use space of 0(\J-\ log a) bits only. In this case it would be desirable to reduce the 
construction time as well. In this section, we show how to compute the set T in time 
0(\C\ logcr), but using space of 0(1^1 logo- logn) bits only. 

The original naming algorithm of [1] is convenient for our purpose as it does the 
naming online without the need to carry the list of fingerprint changes (which is es- 
sentially equivalent to C) until the end of the construction. The complexity of the 
algorithm of p] is 0(na log n log a). The logn factor comes from the complexity of 
the use of binary search tree which is responsible for the following task: given a pair of 
names (sub_nameo, sub_namei) at level i, find whether there is a unique name up_name 
at level i + 1 associated with the pair and if not add a new unique name up -name, asso- 
ciate it with the pair {sub.nameo, sub_namei) and add it to the binary search tree. This 
complexity of the naming algorithm was improved in |14I15| from 0(na log n logo") to 
just 0(|£| logo) by the following way. 

1. Notice that the naming has to deal only with |£| fingerprint changes instead of no. 
This reduces the factor no to \C\. 

2. Deferring the naming process until all the fingerprint changes have been recorded. 
Then using radix sort, the process time of giving unique names at level i + 1 to 
pairs of names from level i is reduced to constant time per pair. This dispenses 
from the use of the binary search tree and reduces the factor logn to just 1. 

This is the approach used in theorem [3] and described in section [3] 

Our approach to improve [I] is to notice that the binary search tree can be re- 
placed with any hash table implementation which will change the time per operation 
from worst-case O(logn) to randomized expected O(l). By this change the query time 
reduces to expected O(£logo), but contrary to theorem |3j the building space remains 
as small as in [I] , as we do not need to record the fingerprint changes during the build- 
ing process. More precisely during the naming process we need only to maintain at 
most logo names (each fingerprint might incur at most logo names, one name at 
each level), which have been attributed so far. These names are recorded in a hash 
table which will use 0(\J- \ logo logn) bits of space. 
Thus, we have proven the following theorem: 



Theorem 6. The set T of fingerprints of a sequence s = si..s n can be computed in 
expected time 0(n + \C\ log a) time using 0((n + | F\ log a) log n) bits of working space. 

6 Randomized Identification Using a Monte Carlo 
Algorithm 

We now briefly sketch our construction algorithm that constructs the set of fingerprints 
T of the sequence s, using only 0(\J-\\ogn) bits (0(|J-"|) words) of temporary space 
and running in time 0(|£|). While this approach might fail with an extremely small 
probability (the approach is said to be Monte Carlo or MC for short), it might still 
be useful in case one wishes to get approximate statistics on fingerprints: counting the 
total number of distinct fingerprints, or counting the total number of strings having a 
given fingerprint, etc. 

To name the fingerprints we use use hash values of size 0(logn) bits. The hash 
values are computed using polynomial hash functions as described in section [2TT] 

Like in the previous section, the naming will be done online: we do not need not to 
store the fingerprint changes during the naming process. Unlike the method described 
in the previous section, the fingerprint names will not be assigned deterministically, 
but will instead be assigned using hash values which could collide with extremely 
small probability. More specifically, in order to identify the existence of a fingerprint 
we will use the polynomial hash functions as described in section |2.1| on the whole 
fingerprint. The polynomial hash function will be computed modulo P, where P is a 
prime selected such that P > n c n 2 a 3 ). The chosen value of P will ensure that each 
fingerprint will be mapped to a distinct value with probability at least n~ c . This can 
easily be seen: we have < na which implies that |_F| 2 < n 2 a 2 . Given that the 
polynomials are of degree at most a, we can deduce that the probability of collision is 
at most < -P^s = n~ c . 

We now describe our algorithm in more detail. We assume that a set S of fingerprints 
can be represented as a list L = (ai, Q2, . . . a p ) of distinct characters such that S — 
{/1, /2, . • • , f p } where fi = Ui<j<i{ctj}. We randomly choose a number r £ [0, P] and 
the random hash function H r will be such that: 

Hr{h) = E (r fs{a > } ) 
\<i<i 

The number H r (fi) will be the unique name associated with the fingerprint fi. Now 
observe that H r (fi) = H r (fi-i) + j-few). Thus computing the label of fi can be done 
online using constant number of arithmetic operations based on cti and -ff r (/i_i). In 
order to maintain the set of already processed fingerprints, we use a dynamic hash 
table (for example using the MC real time dynamic hashing method described in [7]) 
that records the names of already processed fingerprints. Each time we generate the 
name of the fingerprint associated with a given maximal location we probe the dynamic 
hash table to see if that name already exists and if not add it to the hash table. If we 
also need to maintain the set of maximal locations along with the set of fingerprints, 
we just associate a list of maximal locations to each fingerprint and store that list as 
satellite data associated to the fingerprint name stored in the hash table. When the 
name of the fingerprint associated to a maximal location already exists in the hash 
table, this maximal location is added to the list of maximal locations associated with 
the fingerprint name in the hash table. If the fingerprint name did not already exist 



in the hash table, we add the name to hash table and associate a list of maximal 
locations which contains only the maximal location corresponding to the newly added 
fingerprint. 

In conclusion, we have proven the following theorem: 

Theorem 7. The set T of fingerprints of a sequence s — si..s n can be probabilistically 
computed in time 0(n + \C\) using 0((n + \J-\) logn) bits of working space. Moreover 
the set of maximal locations C can be probabilistically determined in time 0(n + \C\) 
using 0((n + \C\) logn) bits of working space. The error rate probability can be made 
to 0(n~ c ) for any constant c. 
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